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Abstract 

In multi-domain proteins, the domains are connected by a flexible 
unstructured region called as protein domain linker. The accurate demar- 
cation of these linkers holds a key to understanding of their biochemical 
and evolutionary attributes. This knowledge helps in designing a suit- 
able linker for engineering stable multi-domain chimeric proteins. Here 
we propose a novel method for the demarcation of the linker based on 
a three-dimensional protein structure and a domain definition. The pro- 
posed method is based on biological knowledge about structural flexibility 
of the linkers. We performed structural analysis on a linker probable re- 
gion (LPR) around domain boundary points of known SCOP domains. 
The LPR was described using a set of overlapping peptide fragments of 
fixed size. Each peptide fragment was then described by geometric in- 
variants (GIs) and subjected to clustering process where the fragments 
corresponding to actual linker come up as outliers. We then discover the 
actual linkers by finding the longest continuous stretch of outlier frag- 
ments from LPRs. This method was evaluated on a benchmark dataset of 
51 continuous multi-domain proteins, where it achieves Fl score of 0.745 
(0.83 precision and 0.66 recall). When the method was applied on 725 con- 
tinuous multi-domain proteins, it was able to identify novel linkers that 
were not reported previously. This method can be used in combination 
with supervised / sequence based linker prediction methods for accurate 
linker demarcation. 

1 Introduction 

Complex proteins are made up of several domains that work independently or in 
tandem with the neighboring domains to perform the intended functions in the 
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cellular processes [T5]. The domains are linked by means of flexible structures 
known as domain linkers. The linkers perform a key role in cooperative inter- 
domain interactions, function regulation, protein stability, folding rates, and 
domain-domain orientation [181 116] . The linkers are known to possess special 
biochemical properties such as high solvent accessibility and a typical amino 
acid composition, due to their role and location in the protein structure. To 
further our understanding on these fronts, a systematic analysis of the known 
linkers needs to be performed. The progress is hampered by lack of availability of 
known and reliable linkers. For instance, there is no database of experimentally 
characterized linkers something that would be of immense importance in such 
studies. The improved understanding of linkers and their biochemical properties 
is crucial in designing linkers for engineering stable multi-domain proteins. 

The most reliable and accurate linker demarcation can be obtained using 
protein structure analysis. Crystallographers usually perform such analysis to 
identify domains and linkers while determining the structure of multi-domain 
proteins. However, in many cases, the domain linkers are not reported explic- 
itly and we need to employ computational methods to demarcate the linkers 
based on sequence or structure of the protein. State of the art sequence based 
methods [37l 01 [11] can be used to identify a list of putative linkers and these 
need to be processed further using available structural features to determine the 
actual linkers. These methods take amino acid sequence as an input and pre- 
dict domain boundaries and linkers using a domain linker index computed from 
amino acid propensities in the known linker region. Miyazaki and co- work- 
ers have proposed neural network [26] and support vector machine [12] based 
techniques using amino acid propensities to distinguish intra-domain loops from 
the inter-domain ones. Tanaka and co-workers used predicted secondary struc- 
ture in addition to amino acid propensities to identify loops, which are further 
distinguished between linker and non-linker loops. Domain prediction meth- 
ods are also used to predict linkers by carving out a stretch of residues in the 
inter domain region around domain boundary points [HI [25] . These methods 
tend to provide multiple linker predictions with liberal allowance for the linker 
boundaries and hence are not very useful for accurate protein linker demarca- 
tion. Besides, most of these methods are unable to predict helical linkers due 
to their assumption about linkers being loops. 

George and Heringa conducted a systematic study of biochemical proper- 
ties of the linkers extracted from three dimensional structures of multi-domain 
proteins |16[ 115) . They first identify structural domains using Taylors method 
|41j and then extract linkers by branching out from domain boundaries until 
the branches become buried within the core of the domain or till the branch 
becomes 40 residues long. This method takes into account biochemical proper- 
ties of linkers for their demarcation without using any of the structural features. 
It is well known that the linkers assume unique structures due to their place- 
ment in the protein structure [H [33J and this forms the basis of our method. 
The proposed method performs accurate demarcation of linkers given a three 
dimensional structure and its domain definition. It first extracts a linker proba- 
ble region (LPR) around domain boundary point and then performs structural 
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analysis of the LPR to demarcate actual linker. We perform a rigorous assess- 
ment of the method using a benchmark dataset of known linkers extracted from 
the literature. 

The rest of the paper is organized as follows: (i) The Method section explains 
the proposed technique in detail, (ii) The Results section presents findings and 
representative linkers demarcated by the proposed method. It also reports accu- 
racy of the method and its performance vis-a-vis other state of the art methods, 
(iii) The Discussion section documents key contributions of the method. 

2 Proposed method 

The method takes a set of protein structures and the corresponding domain 
definitions as input and identifies the corresponding domain linkers. 

It can be broadly divided into the following four steps, as depicted in Figure 
1, (i) Construction of linker probable regions (LPRs), (ii) Parameterization of 
LPRs, (iii) Generation of structure profiles by clustering LPRs, and (iv) De- 
marcation of actual linkers by applying dynamic programming on the structure 
profiles. Note that the current version of the method works only with continuous 
multi-domain proteins. The algorithm is given below: 

We explain each of these steps in greater detail in the rest of the section. 

2.1 Construction and parameterization of LPRs 

Line 1-6 in our algorithm constructs a set of linker probable regions (LPRs) , 1Z, 
from the input set of protein structures along with their domain definitions S. 
We then represent each LPR i.e. r £ 1Z using a set of overlapping tertrahedrons 
as described below. 

Let S be the set of protein structures along with their domain definitions. 
Each element of S is an ordered pair of structure and its domain definition. 
Note that since we consider only continuous multidomain proteins in our anal- 
ysis, we are in a position to define domains using the position of the last amino 
acid residue in the domain. We will refer to the position of last amino acid as 
the endpoint of that domain. The set D in ordered pair (S, D) £ S specifies 
cndpoints of each domain in S. Thus, for a given structure S with e domains, 
D = {di, d.2-, ■ ■ ■ , d e }. Here dj is the endpoint of domain j. Note that the first 
domain starts at the first position and the last domain ends at the last position 
in the protein. Any other domain j with j > 1 starts at position dj—i + 1 and 
ends at position dj in the structure. With this background, we are in a position 
to define LPR. 

DEFINITION 1: Linker probable region (LPR) between domain i and j of 
protein s is a substructure starting at position di — k + 1 and ending at di + k in 
s. It is denoted as LPR(s, i, j). Note that LPR(s, i, j) contains the end poisition 
di of domain i and its length is 2k. The parameter k is chosen based on the 
average linker length as reported in literature [TSJ |3j5] • LPR is the basic 
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Algorithm 1 Linker Demarcation 

Require: S = {(Si, Di), (S 2 , D 2 ), ■ ■ ■ , (S n , D n )}: Set of n protein structures 
with domain definition; and k: Number of positions from one domain to be 
included in linker probable region 
Ensure: C = {(Si, Li), (S 2 , L 2 ), (S n , L n )}. 
l: for each (Si, Di) e S do 
2: R = ExtractLPR(S' i , A); 
3: T = DiscretizeLPR(ii); 
4: T = T + T 
5: 11 = 11 + R 

6: end for 

7: I = InvariantListQ 
8: for each T e T do 
9: X = X+ InvariantGcncration(T) 
10: end for 

11: X z = Standardize(< ; f) 

12: X pc = PCA(^) 

13: C = Cluster(A'p C ) 

14: for each C £ C do 

15: £ = AssignEval(C) 

16: end for 

17: for each T £ T do 

18: U = ComputeSUS(T,£) 

19: end for 

20: for each R £ 1Z do 

21: LPRProfilc = ConstructProfilc(T,W) 

22: L = GctMaximalScoringSubscqucncc(LPRProfilc); 

23: £ = C + L 

24: end for 

25: return C 
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unit of our analysis. 



EXAMPLE 1: Let Sj be the protein structure with two domains. Let D Sj = 
{di, d 2 }. Sj has exactly one LPR that starts at position d\ — k + 1 and ends at 
position d\ + k. 

We construct a set of LPRs, 1Z by extracting LPRs from S, the input set of 
proteins and their domain definitions (line - 2 in algorithm). Note that all 
the LPRs in 1Z are of equal length 2k. The backbone structure of each LPR 
is approximated with its C a coordinates [12] • Thus, R is a set of 2k amino 
acid residues along with their positions in the structure as given by the x,y,z 
coordinates. Now we will describe a procedure DiscretizeLPR (line - 3 in our 
algorithm). Each r € 1Z is discretized into a sequence of 2k — 3 overlapping 
tetrapeptides T = ii,2,3,4> *2,3,4,5, ■ ■ ■ , *(2k-3),(2*-2),(2k-i),(2k)- Note that the 
consecutive tetrapeptides tj and t i+ i in sequence T share an overlap of three 
amino acid residues. Each tetrapeptide in T is added to T, which is a global 
set of tetrapeptides obtained by discretizing LPRs (line - 4 in algorithm). 

Each tetrapeptide t g T represents a tetrahedral geometry and is described 
by a fixed suite of g descriptors, which are invariant under transformations such 
as rotation and translation 251 231 22 ■ These descriptors are referred to 
as geometric invariants (GIs) in the subsequent text. The suite of invariants is 
carefully chosen after extensive trial and error on training data to address the 
following two issues: (a) for superimposable tetrapeptides, the invariants must 
be similar within a tolerance limit 6; and (b) for a pair of non-superimposable 
tetrapeptides t\ and there must be at least one geometric invariant such that 
f(ti) is not similar to f(t 2 ). Here / is a function that calculates a specific GI. 
We represent each tetrapeptide t € T with a suite of fifteen GIs (line - 7) . The 
detailed method for calculating these GIs is given in our previous work O |4"2"] 
(line -9). 

1. Nine GIs are calculated based on the tetrahedral geometry of t and they 
represent signed volume and perimeter of t, length of each edge in t. Since 
there are in all six edges so we have six GIs corresponding to the length. 
One more invariant is computed based on the sum of distance of each 
vertex of t from the centroid of all the vertices. Let Vt be the set of all 
vertices in t. The i-th vertex Vi € V t gives x, y, z coordinate position of 
i-th amino acid residue in t. The centroid is calculated as follows: 



2. The remaining six invariants for t are calculated by forming three triangles 
using vertices in Vt. The three traingles are as follows: v\,V2,Vs, Ui,i>3, i>4 




vuev t 



and the sum of distance from centroid is calculated as 
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and i>i, V2, V4. We calculate area and perimeter for each of these traingles, 
thus accounting for the six remaining invariants. 

Further we standardize X to zero mean and unit standard deviation values. 
Let X z be the set of standardized GIs for T (Line - 11). We perform Principal 
Component Analysis (PCA) to get rid of correlations between GIs [33] . PC A 
gives a new set of orthogonal dimensions, which are linear combinations of the 
original dimensions (GIs in this case). We selected first m significant princi- 
pal components (PCs) to represent the tetrapeptides. Let X pc be the set of 
tetrapeptides represented using m PCs (Line - 12). 

2.2 Structural profiling of LPRs 

Since the linkers are unstructured regions, we believe that the corresponding 
tetrapeptides share structural similarity with fewer other tetrapeptides. On 
the other hand, the tetrapeptides from non-linker region are expected to share 
structural similarity with a large number of other tetrapeptides. Our objective is 
to determine the groups of structurally similar tetrapeptides through clustering 
process and utilize this knowledge towards the demarcation of actual linkers. 
We perform clustering of a set of tetrapeptides X pc , which are represented using 
771 PCs as explained earlier (Line - 13). The clustering is carried out via Matlab 
implementation of hierarchical agglomerative clustering (HAC) [SS]. We use 
euclidean distance as a measure of similarity and ward linkage |38j for merging 
the nodes in the clustering tree. The optimal cut in the resulting dendrogram 
is determined by using inconsistency parameter, which leads to the discovery 
of a set of clusters C. The inconsistency parameter compares each link in the 
cluster hierarchy with the adjacent links to determine natural cluster division 
in the dataset [38] . The clustering process assigns each tetrapeptide to exactly 
one cluster. 

Once the clustering process is over, we obtain the distribution of cluster sizes, 
which is used to assign e- value to each cluster based on its size (Lines 14-16). 
The e-value for a cluster C E C with size |C| is calculated as a/\C\, where a be 
the number of clusters in C with size greater than |C| and \C\ is the total number 
of clusters. Note that the large clusters are expected to contain tetrapeptides 
corresponding to the non-linker regions, while the smaller clusters are more 
likely to contain tetrapeptides corresponding to the linker region. The e-values 
are normalized to zero mean and unit standard deviation to obtain structural 
uniqueness score (SUS) of the cluster. The large clusters have lower SUS, while 
the smaller clusters have higher SUS. The smallest SUS is assigned to the largest 
clusters, while the largest SUS is assigned to the singleton clusters. Thus, the 
SUS indicates the structural uniqueness of the cluster and its propensity to be 
a part of the actual linker. Each tetrapeptide in the cluster is assigned the SUS 
of that cluster (Lines 17-19). The structural profile of an LPR is represented 
using the SUS of its constituent tetrapeptides (Line 21). 
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2.3 Protein domain linker demarcation 

Given the structural profile of an LPR, we are interested in finding the longest 
continuous stretch of tetrapeptides with the highest cumulative SUS. Note that 
such tetrapeptides; being highly unstructured; appear as outliers in the cluster- 
ing process. Hence, the stretch is demarcated as a linker between the domains. 
We are required to enumerate all possible stretches in order to find the one with 
the greatest cumulative SUS. The problem is tackled by using linear time dy- 
namic programming algorithm proposed by Ruzzo and Tompa [?] (Line 22). 
The algorithm takes a sequence of real numbers as input and generates non- 
overlapping, contiguous subsequences having greatest total score. Here, the 
algorithm takes the structural profile of an LPR as input, which is a sequence 
of nine SUS scores of the constituent tetrapeptides (tii,ti2, • • • , U2ks), where 
Ui e R. Let Q be the set of all possible subsequences of tetrapeptides. The 
cumulative SUS for each subsequence is obtained by simply summing the SUS 
of the constituent tetrapeptides. Let CumSUS(<7) be the function that gives cu- 
mulative score for the subsequence q £ Q. The GctMaximalScoringSubsequence 
procedure hnds the subsequence with the greatest cumulative SUS (Line - 22). 
We declare such a subsequence as a domain linker. In case of a tie, a sub- 
sequence with the closest proximity to the domain boundary is declared as a 
domain linker. Thus, 

L = argmax (Jg gCumSUS((7) 

2.4 Evaluation of proposed method 

It is of interest to evaluate the accuracy of the proposed method. In the absence 
of linker database and due to a lot of subjectivity in linker detection by visual 
examination, we decided to extract experimentally reported linkers from the 
literature. We first selected research papers based on PDB reference record 
of each protein in our input dataset. We then manually read the literature 
for extracting information about experimentally detected linkers. We succeded 
in extracting linker information about 51 proteins out of 725 proteins in the 
input set (Supplementary Table 1). These linkers form an evaluation set for the 
benchmark studies. 

After demarcating the linkers using the proposed method, we compare them 
with the literature reported linkers. We compute the accuracy of demarcation 
residue wise as follows: If the residue marked as a part of the linker also happens 
to be the part of the literature reported linker, we count it as a true positive 
match, else it is counted as a false positive match. If the residue that is part 
of the literature reported linker, but is not present in the linker marked by 
the proposed method, it is counted as a false negative match. Let TP denotes 
the number of correctly demarcated linker residues, FP denotes the number of 
incorrectly demarcated linker residues, which are actually non-linker residues 
and FN denotes the number of actual linker residues, which were not included 
in demarcated linker region. Based on TP,TN and FN, we compute precision 
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and recall of the proposed method as follows: 



TP 

Recall = 

TP + FN 

^ TP 
Precision = 



TP + FP 



We also compute Fl measure, which is harmonic mean of precision and recall, 
for the proposed method. 



3 Results 

3.1 Dataset preparation and structural profiling of LPRs 

We have selected 610 continuous multi-domain proteins from the ASTRAL 40 
[5] dataset version 1.69. Out of 610 selected proteins, we have 505 two domain, 
95 three domain and 10 four domain proteins. Based on the SCOP domain 
definition, we selected a stretch of 6 amino acids on either side of the domain 
boundary point to extract LPR of length 12 for each domain connection. Thus, 
we obtain 725 LPRs from the input protein domains. Each LPR is represented 
by nine overlapping tetrapeptides with an overlap of three residues between the 
consecutive tetrapeptides. Each tetrapeptide is represented with fifteen geo- 
metric invariants (GIs) as described earlier. Thus, we obtain 6525 tetrapeptides 
represented in 15 dimensional space spanned by GIs. This dataset is subjected 
to PCA, which reveals that the first 8 PCs cover 99% variance in the data. The 
tetrapeptides were then transformed into a reduced dimensional space spanned 
by first 8 PCs. The transformed dataset of tetrapeptides is subjected to hierar- 
chical clustering algorithm (Matlab implementation) . The resulting dendrogram 
was cut based on inconsistency parameter to obtain 2188 clusters. The distri- 
bution of clusters in terms of their size is shown in Table 1. Note that we obtain 
a large number of smaller clusters, approximately 50%, with size less than three 
members. The largest cluster contains 14 tetrapeptides. 

The larger clusters are assigned smaller e-values, while the smaller clusters 
are assigned larger e-values. We then constructed the structural profile of LPRs 
using the SUS of the corresponding tetrapeptides. The structural profiles, each 
of length nine, were subjected to a maximally scoring subsequence finding algo- 
rithm to demarcate the actual linkers. We were able to demarcate 692 domain 
linkers from 725 input LPRs. In the remaining 33 cases, we observed that these 
LPRs contain tetrapeptides with lower SUS. The distribution of linker lengths 
is shown in Table 2. We found that the average length of the linker detected by 
the proposed method is 5.3 residues. 

3.2 Comparison with other methods 

We were interested in comparing the proposed method against the state of the 
art methods to assess its performance. We used 51 literature reported link- 
ers for the comparative analysis. The same set was used for evaluating the 
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Table 1: Distribution of clusters of tetrapeptides according to their size. 



Linker Length 


4 5 6 7 8 9 10 11 


Linker Count 


286 178 93 67 31 19 11 7 



Table 2: Distribution of linkers according to their lengths. The linkers are obtained via the 
proposed method 



proposed method. We have selected the following methods in the comparison 
study: Ebina et al. [12!, GMQ3] and CHOP [25]. Note that the direct com- 
parison is inappropriate since most of these methods predict putative domain 
linkers from the sequence characteristics, while our method demarcates domain 
linkers by analyzing structural characteristics of LPRs. Moreover most of these 
methods predict multiple putative linkers with certain flexibility on start and 
end positions. From these predictions, we selected the most appropriate linker 
based on the known domain definition and used it for the comparison. A rep- 
resentative examples of the linkers identified by different methods on the input 
set are reported in Table 3. The complete list can be obtained from Supple- 
mentary Table 1. These predictions are matched with the actual linkers and 
the accuracy is calculated in terms of Fl score, which is the harmonic mean 
of precision and recall. The comparative performance of the proposed method 
is given in Table 4. The proposed method achieves overall recall of 0.66 and 
precision of 0.83 on the benchmark dataset. It significantly outperforms state 
of the art methods in terms of the number of linkers identified as well as the 
accuracy of the predictions. 

We further compared our method against DomCut, which predicts the do- 
main cut point based on domain linker index. Note that DomCut does not 
predict the start and the end position of linker. The DomCut prediction is 
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taken as a correct prediction if the predicted domain cut point falls within the 
actual linker. Out of 51 linkers, we found that DomCut predicts correctly in 13 
cases and does not predict the domain cut point in 14 cases. In the remaining 
cases, the predicted cut point does not fall inside the actual linker. 

We computed the agreement between our method against the linker database 
of George and Heringa [HO [TS] , which gives linker predictions for 79 proteins in 
our input dataset. Note that linker prediction is available for few proteins from 
evaluation set used earlier and hence it is not used for the comparison between 
the two methods. We found that the predictions partially agree in 55 cases and 
completely disagree in 24 cases. Reasonable agreement (> 75%) was obtained 
in 5 cases, while medium agreement (< 75% and > 40%) was obtained in 29 
cases and weak agreement was observed in the remaining cases. 

3.3 Representative linkers 

The examples of demarcated linkers by the proposed method are shown in Fig- 
ures 2A2J. Since the method use LPR for demarcation, the entire 3-D structure 
of corresponding domains is not shown. Instead, a stretch of 16 amino acids 
on either side of the domain boundary point is shown to maintain clarity of 
representation. Here we describe a few representative linkers. 

Example of literature reported linkers, also demarcated by our method 

1. Streptococcus pneumonia SP14.3 (PDB Code: 1IB8) 

Streptococcis pneumonia is a deadly human pathogen causing high mor- 
tality and morbidity rates [59] , SP14.3 is a key protein responsible for 
growth of the pathogene. The three-dimensional structure of SP14.3 con- 
tains a very short linker of size 3 between residues 88-90 with a moderate 
flexibility. Our method predicts the linker exactly at the same place as 
reported by Yu et. al. 49J. The linker plays role in relative orientation 
of domains and maintaining rotational cooorelation of domains. 

2. Yeast Secl8p (PDB Code: 1CR5) 

Yeast Secl8p is a hexameric ATPase with a central role in vesicle traf- 
ficking [3 . The reported linker is located between residues 104-113 and 
is flexible in its structure. The SNAP binding site is located opposite to 
the linker. It connects two beta rich sub domains and is likely to facilitate 
different sub-domain orientation. Our method detected the linker between 
residues 104-111, which is enclosed within the literature reported linker 
(Figure 2A). 

3. TnsA (PDB Code: 1F1Z) 

TnsA carries out DNA breakage at 5 end of transposon. It contains a 
six sized loop linker between residues 165-170 that connects two domains 
of homodimeric cndonuclease enzymes [20j . Our method predicted eight 
sized linker between residues 164-171 (Figure 2B). The linker is likely to 
play a role in cooperative domain binding. 
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Examples of novel linkers that are not reported in literature 
Figure 2G shows the novel linker in Ornithine transcarbamoylase (1DUV) which 
is not reported in the literature. Fascin (1DFC) has three domains, and the two 
linkers delimiting these domains were demarcated accurately (Figures 2H and 
21). Spectrin beta chain (1S35) is an all alpha protein with an alpha-helical 
linker between two domains (Figure 2J). A helical conformation of the linker 
region is compatible with a variety of different twist angles between Spectrin 
repeats, leading to many different conformations. 

4 Discussion 

We proposed a novel objective method for accurate demarcation of linkers. The 
method takes three dimensional protein structure and domain definition as an 
input and provides accurate linker demarcation. This is the first instance where 
structural aspects are rigorously analyzed in the linker demarcation task. The 
earlier methods have used biochemical and sequence properties for the same 
task. Since the proposed method provides structural perspective in the demar- 
cation, it can be used in tandem with the other methods reported in literature. 

As stated earlier, accurate domain linker demarcation is a key to under- 
stand their biochemical properties. Given a three dimensional structure and 
its domain definition, the linkers can be demarcated either through direct vi- 
sualization or through objective automated methods reported in the literature 
[H3 [23 HI [TTJ . Visualization methods are often subjective, while auto- 
mated methods demarcate linkers only approximately. The proposed method 
demarcates the linker more precisely than the other methods as demonstrated 
on the benchmark dataset. Our method outperforms other methods with Fl 
score of 0.745 (precision 0.83 and recall 0.66) on the benchmark dataset. 

The method is also the first of its kind in exploiting biological knowledge 
about structural uniqueness of linkers. Since the linkers possess flexible struc- 
ture, their constituent fragments are unique and appear as outliers during clus- 
tering process [J7j- Since the outliers are assigned the maximum SUS, the 
stretch of fragments with maximum cumulative SUS corresponds to the actual 
linker. The discovery of such stretch was performed using an efficient linear 
time dynamic programming algorithm. 

The proposed method has following configurable parameters, which can be 
adjusted to achieve desired results: (i) k which affects the length of LPRs and 
(ii) the length of the peptide fragment. The current study uses LPRs of length 
12, tetrahedron as a choice for local structure and clusters of tetrahedrons in 
LPRs. The length of LPR was decided based on the prior reports of average 
linker lengths |16j [26l |39j. Flexibility can be added to LPR selection with 
the help of other methods reported in the literature. For instance, we can 
use Taylors method [H] a s applied by [TB] to come up with more appropriate 
LPRs. The LPRs thus obtained can be processed further with the help of 
the proposed method to demarcate accurate linkers. In the present study, we 
extract tetra-peptides from LPRs and perform the clustering. The clusters are 
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then assigned SUS based on cluster size distribution. Our method is able to 
find linkers with irregular or unique structure. It is also able to detect linkers 
containing alpha-helices and beta-strands with structural purturbations. Due 
to these purturbations, these tetrapeptides are part of smaller clusters, which 
are often assigned higher SUS and hence our method is able to detect linkers 
containing such structures. However, our method is unable to detect linkers 
made up of regular alpha-helices and beta-strands, since the regular structures 
tend to form larger clusters and usually have smaller SUS compared to the 
irregular structures. 

Finally, we plan to construct a database of linkers demarcated via the pro- 
posed method. The database will help to further our understanding of bio- 
chemical properties of linkers and help to design better linkers while engineering 
multi-domain proteins. We are planning to use insights about linkers from this 
work to develop a sequence based linker prediction method. This can be useful 
in predicting protein domains by virtue of linker prediction. 
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Fi gure 1; Schematic for protein linker demarcation: (i) Based on SCOP domain definition; 
we identify domain boundary point in the protein structure. A twelve residue long linker 
probable region (LPR) is carved out by taking six amino acids each from either domain, (ii) 
LPR is represented with nine overlapping tetrapeptides. (iii) Each tetrapcptidc is described 
with a set of 15 geometric invariants (GIs). The geometric invariants are standardized to 
zero mean and unit standard deviation. The number of dimensions is reduced via principal 
component analysis (PCA). We select first 8 PCs to represent tetrapeptides. The tetrapep- 
tides are then transformed into PC space, (iv) The hierarchical agglomerative clustering is 
performed to identify clusters of similar tetrapeptides. The clusters are assigned E-values 
based on the cluster size distribution. The E-values are standardized to zero mean and unit 
standard deviation, yielding structural uniqueness scores (SUS). (v) Based on the membership 
of a tetrapeptide to a particular cluster, we construct LPR structural profiles using SUS of 
the respective clusters, (vi) We perform maximally scoring subsequence discovery on top of 
LPR structural profile to identify the continuous stretch with maximum cumulative SUS. This 
stretch of tetrapeptides corresponds to the actual linker. 
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Table 3: The table contains a representative examples of linkers extracted by the proposed 
method. We have also shown actual linker as extracted from the literature as well as the 
linkers predicted by state of the art methods. The column Lit. Ref. provides the literature 
reference for the actual linker. 
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Table 4: Performance of various methods on the benchmark dataset of 51 linkers 
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Figure 2: Different categories of substructures predicted as linkers. The segment of the N- 
terminal domain is shown in cyan whereas the segment of the C-terminal domain is shown 
in blue. The segment of the structure demarcated as a linker is shown in red. (A) 1CR5 
[N-terminal domain of Secl8p], helical linker [LEU 104 - GLN 111]; (B) 1F1Z [TnsA en- 
donuclease], linker [THR 164 - VAL 171]; (C) 1VI7 [Hypothetical protein YigZ], beta linker 
[THR 133 - PRO 138]; (D) 1P2F [Response regulator DrrB from Thermotoga maritime], loop 
linker [GLU 118 - GLY 121]; (E) 1R89 [tRNA nucleotidyltransferase], loop linker [GLY 139 
- GLY 143]; (F) 1DT9 [Eukaryotic peptide chain release factor subunit 1], novel linker [LEU 
140 - SER 144]; (G) 1DUV [Ornithine transcarbamoylase] , novel linker [LEU 148 - ALA 152]; 
(H) 1DFC [Fascin], loop linker between domain 1 and domain 2 [HIS 1135 - GLN 1141]; (I) 
1DFC [Fascin], loop linker between domain 2 and domain 3 [SER 1259 - GLN 1262]; (J) 1S35 
[Spectrin beta chain], helical linker [THR 1163 PHE 1170]. The figures are prepared using 
PyMOL (http://www.pymol.org). 
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